NYPD Complaint Data

0. Context: NYPD Complaint Data

Description of the Project and Expected Benefits

  • Using CARTO, we will explore and visualize the New York City Police Department complaints data and enrich them using spatial data
  • Analyse the profiles of the victims and suspects (segmentate in clusters)
  • Analyse the type of offenses
  • Compare the data against genders.
  • Identify which zones of New York are more dangerous, and where should be more police so that citizens are safer

For the development of this project, we'll be following the Analytics workflow seen in class:

analytics_workflow.png

0.1 Dataset

Dataset including all valid felony, misdemeanor, and violation crimes reported to the New York City Police Department (NYPD) from 2018 to 2020.

First of all, we will show the definitions of felony, misdemeanor and violation for a better understanding of the project:

  • Misdemeanor: Offenses lower than felonies and generally those punishable by fine, penalty, Forfeiture, or imprisonment other than in a penitentiary. Under federal law, and most state laws, any offense other than a felony is classified as a misdemeanor. Certain states also have various classes of misdemeanors.

  • Felony: A serious crime, characterized under federal law and many state statutes as any offense punishable by death or imprisonment in excess of one year. Crimes classified as felonies include, among others, Treason, Arson, murder, rape, Robbery, Burglary, Manslaughter, and Kidnapping.

  • Violation: An act done unlawfully and with force, involving carnal knowledge.

  • Borough: Division of New York City.

Column. Name: Description - Type

  1. CMPLNT_NUM: Randomly generated persistent ID for each complaint - Number
  2. ADDR_PCT_CD: The precinct in which the incident occurred - Number
  3. BORO_NM: The name of the borough in which the incident occurred - Plain Text
  4. CMPLNT_FR_DT: Exact date of occurrence for the reported event (or starting date of occurrence, if CMPLNT_TO_DT exists) - Date & Time
  5. CMPLNT_FR_TM: Exact time of occurrence for the reported event (or starting time of occurrence, if CMPLNT_TO_TM exists) - Plain Text
  6. CMPLNT_TO_DT: Ending date of occurrence for the reported event, if exact time of occurrence is unknown - Date & Time
  7. CMPLNT_TO_TM: Ending time of occurrence for the reported event, if exact time of occurrence is unknown - Plain Text
  8. CRM_ATPT_CPTD_CD: Indicator of whether crime was successfully completed or attempted, but failed or was interrupted prematurely - Plain Text
  9. HADEVELOPT: Name of NYCHA housing development of occurrence, if applicable - Plain Text
  10. HOUSING_PSA: Development Level Code - Number
  11. JURISDICTION_CODE: Jurisdiction responsible for incident. Either internal, like Police(0), Transit(1), and Housing(2); or external(3), like Correction, Port Authority, etc. - Number
  12. JURIS_DESC: Description of the jurisdiction code - Plain Text
  13. KY_CD: Three digit offense classification code - Number
  14. LAW_CAT_CD: Level of offense: felony, misdemeanor, violation - Plain Text
  15. LOC_OF_OCCUR_DESC: Specific location of occurrence in or around the premises; inside, opposite of, front of, rear of - Plain Text
  16. OFNS_DESC: Description of offense corresponding with key code - Plain Text
  17. PARKS_NM: Name of NYC park, playground or greenspace of occurrence, if applicable (state parks are not included) - Plain Text
  18. PATROL_BORO: The name of the patrol borough in which the incident occurred - Plain Text
  19. PD_CD: Three digit internal classification code (more granular than Key Code) - Number
  20. PD_DESC: Description of internal classification corresponding with PD code (more granular than Offense Description) - Plain Text
  21. PREM_TYP_DESC: Specific description of premises; grocery store, residence, street, etc. - Plain Text
  22. RPT_DT: Date event was reported to police - Date & Time
  23. STATION_NAME: Transit station name - Plain Text
  24. SUSP_AGE_GROUP: Suspect’s Age Group - Plain Text
  25. SUSP_RACE: Suspect’s Race Description - Plain Text
  26. SUSP_SEX: Suspect’s Sex Description - Plain Text
  27. TRANSIT_DISTRICT: Transit district in which the offense occurred. - Number
  28. VIC_AGE_GROUP: Victim’s Age Group - Plain Text
  29. VIC_RACE: Victim’s Race Description - Plain Text
  30. VIC_SEX : Victim’s Sex Description - Plain Text
  31. X_COORD_CD: X-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104) - Number
  32. Y_COORD_CD: Y-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104) - Number
  33. Latitude: Midblock Latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326) - Number
  34. Longitude: Midblock Longitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326) - Number
  35. Lat_Lon: Location
  36. New Georeferenced Column: Point

0.2 Imports

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt 
from sklearn.cluster import KMeans
import scipy.cluster.hierarchy as sch

# Set Pandas to show all the columns
pd.set_option('display.max_columns', None)

1. Data gathering

1.1 Read the data

In [2]:
data_path = './data/NYPD_Complaint_Data_Current__Year_To_Date_.csv'

original_data = pd.read_csv(data_path, delimiter=',')

original_data.head()
Out[2]:
CMPLNT_NUM ADDR_PCT_CD BORO_NM CMPLNT_FR_DT CMPLNT_FR_TM CMPLNT_TO_DT CMPLNT_TO_TM CRM_ATPT_CPTD_CD HADEVELOPT HOUSING_PSA JURISDICTION_CODE JURIS_DESC KY_CD LAW_CAT_CD LOC_OF_OCCUR_DESC OFNS_DESC PARKS_NM PATROL_BORO PD_CD PD_DESC PREM_TYP_DESC RPT_DT STATION_NAME SUSP_AGE_GROUP SUSP_RACE SUSP_SEX TRANSIT_DISTRICT VIC_AGE_GROUP VIC_RACE VIC_SEX X_COORD_CD Y_COORD_CD Latitude Longitude Lat_Lon New Georeferenced Column
0 972326799 81 NaN 09/28/2020 21:27:00 NaN NaN COMPLETED NaN NaN NaN N.Y. POLICE DEPT 101 FELONY OUTSIDE MURDER & NON-NEGL. MANSLAUGHTER NaN NaN NaN NaN NaN 09/28/2020 NaN NaN NaN NaN NaN 25-44 BLACK M 1003904 186483 40.678516 -73.929143 (40.67851591200008, -73.92914304899993) POINT (-73.92914304899993 40.67851591200008)
1 376304873 52 NaN 09/27/2020 19:13:00 NaN NaN COMPLETED NaN NaN NaN N.Y. POLICE DEPT 101 FELONY INSIDE MURDER & NON-NEGL. MANSLAUGHTER NaN NaN NaN NaN NaN 09/27/2020 NaN 45-64 WHITE HISPANIC M NaN UNKNOWN BLACK HISPANIC M 1017928 258050 40.874906 -73.878224 (40.87490600500007, -73.87822380899996) POINT (-73.87822380899996 40.87490600500007)
2 299326203 75 NaN 09/21/2020 01:21:00 NaN NaN COMPLETED NaN NaN NaN N.Y. POLICE DEPT 101 FELONY OUTSIDE MURDER & NON-NEGL. MANSLAUGHTER NaN NaN NaN NaN NaN 09/21/2020 NaN NaN NaN NaN NaN 25-44 BLACK M 1021234 181211 40.663990 -73.866692 (40.66399002800006, -73.86669235099998) POINT (-73.86669235099998 40.66399002800006)
3 674946147 121 NaN 09/15/2020 08:46:00 NaN NaN COMPLETED NaN NaN NaN N.Y. POLICE DEPT 101 FELONY OUTSIDE MURDER & NON-NEGL. MANSLAUGHTER NaN NaN NaN NaN NaN 09/15/2020 NaN NaN NaN NaN NaN 25-44 BLACK M 938430 170972 40.635845 -74.165090 (40.63584491100005, -74.165090337) POINT (-74.165090337 40.63584491100005)
4 416422620 101 NaN 09/08/2020 13:50:00 NaN NaN COMPLETED NaN NaN NaN N.Y. POLICE DEPT 101 FELONY OUTSIDE MURDER & NON-NEGL. MANSLAUGHTER NaN NaN NaN NaN NaN 09/08/2020 NaN <18 BLACK M NaN 18-24 BLACK M 1049837 157548 40.598875 -73.763823 (40.59887464700005, -73.76382298499993) POINT (-73.76382298499993 40.59887464700005)

2. Data understanding and preparation

Once the Challenge/Problem definition and Data gathering stages are covered, the next step is the Exploration Data Analysis and Transformation. In this stage, Visual Analytics takes a key role to explore and understand the features and their relationships between them. Thanks to this process, we will have more context to determine the next steps to be done according to the Challenge/Problem definition.

2.1 Dataset Exploratory Data Analysis (EDA)

Analyze the main characteristics (type of variables, number of records, nulls, etc...) of the variables of the dataset.

In [3]:
original_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306656 entries, 0 to 306655
Data columns (total 36 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   CMPLNT_NUM                306656 non-null  int64  
 1   ADDR_PCT_CD               306656 non-null  int64  
 2   BORO_NM                   306292 non-null  object 
 3   CMPLNT_FR_DT              306656 non-null  object 
 4   CMPLNT_FR_TM              306656 non-null  object 
 5   CMPLNT_TO_DT              276747 non-null  object 
 6   CMPLNT_TO_TM              276839 non-null  object 
 7   CRM_ATPT_CPTD_CD          306656 non-null  object 
 8   HADEVELOPT                15051 non-null   object 
 9   HOUSING_PSA               23239 non-null   float64
 10  JURISDICTION_CODE         306310 non-null  float64
 11  JURIS_DESC                306656 non-null  object 
 12  KY_CD                     306656 non-null  int64  
 13  LAW_CAT_CD                306656 non-null  object 
 14  LOC_OF_OCCUR_DESC         257472 non-null  object 
 15  OFNS_DESC                 306652 non-null  object 
 16  PARKS_NM                  2083 non-null    object 
 17  PATROL_BORO               306310 non-null  object 
 18  PD_CD                     306310 non-null  float64
 19  PD_DESC                   306310 non-null  object 
 20  PREM_TYP_DESC             305771 non-null  object 
 21  RPT_DT                    306656 non-null  object 
 22  STATION_NAME              5469 non-null    object 
 23  SUSP_AGE_GROUP            234988 non-null  object 
 24  SUSP_RACE                 234988 non-null  object 
 25  SUSP_SEX                  234988 non-null  object 
 26  TRANSIT_DISTRICT          5469 non-null    float64
 27  VIC_AGE_GROUP             306655 non-null  object 
 28  VIC_RACE                  306655 non-null  object 
 29  VIC_SEX                   306655 non-null  object 
 30  X_COORD_CD                306656 non-null  int64  
 31  Y_COORD_CD                306656 non-null  int64  
 32  Latitude                  306656 non-null  float64
 33  Longitude                 306656 non-null  float64
 34  Lat_Lon                   306656 non-null  object 
 35  New Georeferenced Column  306656 non-null  object 
dtypes: float64(6), int64(5), object(25)
memory usage: 84.2+ MB

Removing not useful variables

Looking at the definitions of the variables, we decided to remove some of them that are not useful for what we need. They are:

  • CMPLNT_TO_DT and CMPLNT_TO_TM: they say when the reported event finished, not needed for this project.
  • HADEVELOPT and HOUSING_PSA: Name of NYCHA housing development of occurrence and Development Level Code, not interesting for this project
  • LOC_OF_OCCUR_DESC: Specific location of occurrence in or around the premises, not needed for this project.
  • KY_CD, PD_CD, JURISDICTION_CODE: Since we have the descriptions, we don't need the codes.
  • X_COORD_CD, Y_COORD_CD, Lat_Lon and New Georeferenced Column: we will use only the Latitude and Longitude attributes.
In [4]:
#We decided to remove those variables since they are not useful
data=original_data.drop(['CMPLNT_TO_DT','CMPLNT_TO_TM', 'HADEVELOPT', 'HOUSING_PSA', 'KY_CD', 'LOC_OF_OCCUR_DESC', 'PD_CD', 'RPT_DT', 'X_COORD_CD', 'Y_COORD_CD', 'Lat_Lon', 'New Georeferenced Column', 'JURISDICTION_CODE'], axis = 1)

Our dataset has 306656 registers. At the beginning it consisted of 36 columns, but after deleting the non useful ones we end having 24 columns.

In [5]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306656 entries, 0 to 306655
Data columns (total 23 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   CMPLNT_NUM        306656 non-null  int64  
 1   ADDR_PCT_CD       306656 non-null  int64  
 2   BORO_NM           306292 non-null  object 
 3   CMPLNT_FR_DT      306656 non-null  object 
 4   CMPLNT_FR_TM      306656 non-null  object 
 5   CRM_ATPT_CPTD_CD  306656 non-null  object 
 6   JURIS_DESC        306656 non-null  object 
 7   LAW_CAT_CD        306656 non-null  object 
 8   OFNS_DESC         306652 non-null  object 
 9   PARKS_NM          2083 non-null    object 
 10  PATROL_BORO       306310 non-null  object 
 11  PD_DESC           306310 non-null  object 
 12  PREM_TYP_DESC     305771 non-null  object 
 13  STATION_NAME      5469 non-null    object 
 14  SUSP_AGE_GROUP    234988 non-null  object 
 15  SUSP_RACE         234988 non-null  object 
 16  SUSP_SEX          234988 non-null  object 
 17  TRANSIT_DISTRICT  5469 non-null    float64
 18  VIC_AGE_GROUP     306655 non-null  object 
 19  VIC_RACE          306655 non-null  object 
 20  VIC_SEX           306655 non-null  object 
 21  Latitude          306656 non-null  float64
 22  Longitude         306656 non-null  float64
dtypes: float64(3), int64(2), object(18)
memory usage: 53.8+ MB

Managing null values

In [6]:
data.isnull().sum()
Out[6]:
CMPLNT_NUM               0
ADDR_PCT_CD              0
BORO_NM                364
CMPLNT_FR_DT             0
CMPLNT_FR_TM             0
CRM_ATPT_CPTD_CD         0
JURIS_DESC               0
LAW_CAT_CD               0
OFNS_DESC                4
PARKS_NM            304573
PATROL_BORO            346
PD_DESC                346
PREM_TYP_DESC          885
STATION_NAME        301187
SUSP_AGE_GROUP       71668
SUSP_RACE            71668
SUSP_SEX             71668
TRANSIT_DISTRICT    301187
VIC_AGE_GROUP            1
VIC_RACE                 1
VIC_SEX                  1
Latitude                 0
Longitude                0
dtype: int64

As we said before, now the dataset has 306656 registers and 24 variables. There are some variables with null values, which can be seen above. The variables with more than 100.000 null values are: PARKS_NM, STATION_NAME and TRANSIT_DISTRICT.

  • PARKS_MN: Name of NYC park, playground or greenspace of occurrence, if applicable (state parks are not included). So it has a value only if the offense occur in a park, that's why there are a lot of nulls.
  • STATION_NAME: Transit station name. Only if the offense did occur in the public transport, that's why they are a lot of nulls.
  • TRANSIT_DISTRICT: Transit district in which the offense occurred. Only if the offense did occur in the public transport, that's why they are a lot of nulls.

Other variables with null values are:

  • BORO_NM: 364, as we are going to compute the boroughs for all complaints using the latitude and longitude variables, it does not matter that now some of them are not known.
  • PREM_TYP_DESC: 885, it specifies the description of premises; grocery store, residence, street, etc. We will drop those entries with null values.
  • SUSP_AGE_GROUP, SUSP_RACE, SUSP_SEX: 71668, we will see what to do with these entries later on.
  • VIC_AGE_GROUP, VIC_RACE, VIC_SEX: 1, we will drop this entry.
  • PATROL_BORO and PD_DESC: 346: those entries having null values in PATROL_BORO are the same 346 records that have null values in PD_DESC. We are going to drop those records with PD_DESC null since they will not allow us to classify the complaints.
  • OFNS_DESC: 4, we will drop those entries without offense description.
In [7]:
#Dropping the entries with null values in the variables previously explained:
data= data[(data['PREM_TYP_DESC'].notnull())]
data= data[(data['VIC_AGE_GROUP'].notnull())]
data= data[(data['PD_DESC'].notnull())]
data= data[(data['OFNS_DESC'].notnull())]
In [8]:
print("Number of registers: ", len(data))
data.isnull().sum()
Number of registers:  305766
Out[8]:
CMPLNT_NUM               0
ADDR_PCT_CD              0
BORO_NM                 18
CMPLNT_FR_DT             0
CMPLNT_FR_TM             0
CRM_ATPT_CPTD_CD         0
JURIS_DESC               0
LAW_CAT_CD               0
OFNS_DESC                0
PARKS_NM            303685
PATROL_BORO              0
PD_DESC                  0
PREM_TYP_DESC            0
STATION_NAME        300297
SUSP_AGE_GROUP       71376
SUSP_RACE            71376
SUSP_SEX             71376
TRANSIT_DISTRICT    300297
VIC_AGE_GROUP            0
VIC_RACE                 0
VIC_SEX                  0
Latitude                 0
Longitude                0
dtype: int64

Now we have 305766 entries, which only contain null values in variables that makes sense to keep them. But instead of having them as NULL, for the categorical variables we don't know we will set those values to "UNKNOWN". In the case of parks or transport, we will keep them as NULL since if a crime did not occur in a park, it is not that the park is unknown, but that there should be no park at all (NULL).

In [9]:
data["BORO_NM"]=data["BORO_NM"].astype(object).fillna("UNKNOWN")
data["SUSP_AGE_GROUP"]=data["SUSP_AGE_GROUP"].astype(object).fillna("UNKNOWN")
data["SUSP_RACE"]=data["SUSP_RACE"].astype(object).fillna("UNKNOWN")
data["SUSP_SEX"]=data["SUSP_SEX"].astype(object).fillna("UNKNOWN")

Correct datatypes

We are going to look at each variable and see if they have the correct data type. If not, we will change them.

In [10]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 305766 entries, 12 to 306655
Data columns (total 23 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   CMPLNT_NUM        305766 non-null  int64  
 1   ADDR_PCT_CD       305766 non-null  int64  
 2   BORO_NM           305766 non-null  object 
 3   CMPLNT_FR_DT      305766 non-null  object 
 4   CMPLNT_FR_TM      305766 non-null  object 
 5   CRM_ATPT_CPTD_CD  305766 non-null  object 
 6   JURIS_DESC        305766 non-null  object 
 7   LAW_CAT_CD        305766 non-null  object 
 8   OFNS_DESC         305766 non-null  object 
 9   PARKS_NM          2081 non-null    object 
 10  PATROL_BORO       305766 non-null  object 
 11  PD_DESC           305766 non-null  object 
 12  PREM_TYP_DESC     305766 non-null  object 
 13  STATION_NAME      5469 non-null    object 
 14  SUSP_AGE_GROUP    305766 non-null  object 
 15  SUSP_RACE         305766 non-null  object 
 16  SUSP_SEX          305766 non-null  object 
 17  TRANSIT_DISTRICT  5469 non-null    float64
 18  VIC_AGE_GROUP     305766 non-null  object 
 19  VIC_RACE          305766 non-null  object 
 20  VIC_SEX           305766 non-null  object 
 21  Latitude          305766 non-null  float64
 22  Longitude         305766 non-null  float64
dtypes: float64(3), int64(2), object(18)
memory usage: 56.0+ MB
  • CMPLNT_NUM, ADDR_PCT_CD, are correctly integers.
  • BORO_NM, CRM_ATPT_CPTD_CD, JURIS_DESC, LAW_CAT_CD, OFNS_DESC, PARKS_NM, PATROL_BORO, PD_DESC, PREM_TYP_DESC, STATION_NAME, SUSP_AGE_GROUP, SUSP_RACE, SUSP_SEX, VIC_AGE_GROUP, VIC_RACE, VIC_SEX are objects and should be categorical
  • CMPLNT_FR_DT, CMPLNT_FR_TM are objects and should be datetime.
  • TRANSIT_DISTRICT, Latitude, Longitude are correctly floats.

Converting datetimes variables to the correct format and type

In [11]:
#Converting dates
data["CMPLNT_FR_DT"]=pd.to_datetime(data['CMPLNT_FR_DT'], format="%m/%d/%Y", errors = 'coerce')
#data["CMPLNT_FR_DT"]
In [12]:
#We will filter complaints form 2019-09-30 since the last record is from 2020-09-30. This way we have one year. 
data= data[(data['CMPLNT_FR_DT'] > '2019-09-30')]
In [13]:
#Converting hours
data['CMPLNT_FR_TM']= pd.to_datetime(data['CMPLNT_FR_TM'], errors = 'coerce')
In [14]:
#Keep only those registers with ages in a valid group, since there are some wrong entries.
data= data[(data['VIC_AGE_GROUP']=="18-24") | (data['VIC_AGE_GROUP']=="25-44") | (data['VIC_AGE_GROUP']== "45-64") | (data['VIC_AGE_GROUP']=="65+") | (data['VIC_AGE_GROUP']== "<18") | (data['VIC_AGE_GROUP']== "UNKNOWN")]
data= data[(data['SUSP_AGE_GROUP']=="18-24") | (data['SUSP_AGE_GROUP']=="25-44") | (data['SUSP_AGE_GROUP']== "45-64") | (data['SUSP_AGE_GROUP']=="65+") | (data['SUSP_AGE_GROUP']== "<18") | (data['SUSP_AGE_GROUP']== "UNKNOWN")]

Converting categorical variables

In [15]:
# Define the numerical variables
num_variables = [column for column, datatype in data.dtypes.items() if datatype in (np.int64, np.float64)]
date_variables=['CMPLNT_FR_DT', 'CMPLNT_FR_TM']
# Define the categorical ones
categorical_variables = [column for column in data.columns if column not in num_variables+date_variables]
print(categorical_variables)
['BORO_NM', 'CRM_ATPT_CPTD_CD', 'JURIS_DESC', 'LAW_CAT_CD', 'OFNS_DESC', 'PARKS_NM', 'PATROL_BORO', 'PD_DESC', 'PREM_TYP_DESC', 'STATION_NAME', 'SUSP_AGE_GROUP', 'SUSP_RACE', 'SUSP_SEX', 'VIC_AGE_GROUP', 'VIC_RACE', 'VIC_SEX']
In [16]:
#Printing the number of unique values of the categorical variables, and transforming them to their corresponding type. 
for variable in categorical_variables:
    print(f'Unique values in {variable} : {len(data[variable].unique())}')
    data[variable] = data[variable].astype('category')
Unique values in BORO_NM : 6
Unique values in CRM_ATPT_CPTD_CD : 2
Unique values in JURIS_DESC : 18
Unique values in LAW_CAT_CD : 3
Unique values in OFNS_DESC : 57
Unique values in PARKS_NM : 432
Unique values in PATROL_BORO : 8
Unique values in PD_DESC : 323
Unique values in PREM_TYP_DESC : 74
Unique values in STATION_NAME : 359
Unique values in SUSP_AGE_GROUP : 6
Unique values in SUSP_RACE : 7
Unique values in SUSP_SEX : 4
Unique values in VIC_AGE_GROUP : 6
Unique values in VIC_RACE : 7
Unique values in VIC_SEX : 4

Plot distributions

We plot the distributions of complaints by weekday to see if there is any interesting information. But we observe that the numbers of complaints are very similar independent of the weekday.

In [17]:
plt.title("Number of complaints classified by day of the week")
data['CMPLNT_NUM'].groupby(data["CMPLNT_FR_DT"].dt.weekday).count().plot(kind="bar")
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fcf4515e5b0>

Here we group and plot the complaints by time. We observe that the hours with less reported reports are from 2 to 7 am, which makes sense since at those hours most people are sleeping. However, during midnight there are also crimes.

In [18]:
plt.title("Number of complaints distributed by hours")
data['CMPLNT_FR_TM'].groupby(data["CMPLNT_FR_TM"].dt.hour).count().plot(kind="bar")
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fcf45056a30>
In [19]:
#Defining a function that will be used several times to plot a countplot of a determined variable in a dataframe
def plot_countplot(data, x):
    ax = sns.countplot(x=x, data=data)

    total= data[x].value_counts().sum()
    plt.title("Countplot of "+x)
    for p in ax.patches:
        txt = str((p.get_height()/total*100).round(2)) + '%'
        txt_x = p.get_x() 
        txt_y = p.get_height()
        ax.text(txt_x,txt_y,txt)
    plt.xticks(rotation=90)
    plt.show()

Plotting countplots of the categorical variables:

In [20]:
for variable in categorical_variables:
      if len(data[variable].unique())<20:
        plot_countplot(data, variable)

With these plots, we can extract some information about the crimes in New York in a general scale:

  • Brooklyn is the borough with more complaints, followed by Manhattan.
  • Most of crimes where completed.
  • The jurisdiction responsible for the incident in most of the cases is the N.Y. Police Department, since it includes the majority of cases.
  • Half of the crimes commited are Missdemeanor, followed by felonies and violations.
  • Most ages of the suspects are unknown, meaning they were not able to identify them. From the identified ones, people between 25 and 44 years were the ones that commited more crimes. Same happens with the race of the suspects. From the identified ones, a big majority of them are black, followed by white hispanic. Most of the suspects are men.
  • Regarding the victims, the big majority of them are between 25 and 44 years old. Races of some victims are also unknown, but from the known ones black represent the majority, followed by white hispanic and white. In relation to the sex, the difference is not as big as with suspects, since the victims are almost equal distributed between men and women.
In [21]:
#Defining a function that will be used to plot a variable distinguishing the values of another one.
def plot_by_variable(data, var1, var_hue):
    ax = sns.catplot(x=var1, hue=var_hue, kind="count",palette="cubehelix", data=data, height=4, aspect=2)
    ax.set_xticklabels(rotation=30).set_titles("{col_name} {col_var}") 
    return ax

Now we show some plots that will be later on explained in the visualizations, but that are useful now to see what we can expect.

In [22]:
ax=plot_by_variable(data, "VIC_RACE", "VIC_SEX")
ax.set(title='Plot of victims race classified by their sex')
ax=plot_by_variable(data, "VIC_AGE_GROUP", "VIC_SEX")
ax.set(title='Plot of victims age group classified by their sex')
ax=plot_by_variable(data, "BORO_NM", "VIC_SEX")
ax.set(title='Plot of crimes done in boroughs classified by the sex of the victims')
ax=plot_by_variable(data, "BORO_NM", "SUSP_SEX")
ax.set(title='Plot of crimes done in boroughs classified by the sex of the suspects')
ax=plot_by_variable(data, "BORO_NM", "SUSP_RACE")
ax.set(title='Plot of crimes done in boroughs classified by the race of the suspects')
Out[22]:
<seaborn.axisgrid.FacetGrid at 0x7fcf44e3b7f0>
In [23]:
"""10 parks with more crimes in NYC"""

data['PARKS_NM'].value_counts()[:10].plot.bar()
plt.ylabel('Count')
plt.title('10 most dangerous parks in NYC')
plt.show()
 
In [24]:
"""10 most common crimes in NYC"""

data['OFNS_DESC'].value_counts()[:10].plot.bar()
plt.ylabel('Count')
plt.title('10 most common crimes in NYC')
plt.show()
In [25]:
"""10 stations with more crimes in NYC"""

data['STATION_NAME'].value_counts()[:10].plot.bar()
plt.ylabel('Count')
plt.title('10 stations with more crimes in NYC')
plt.show()

Let's look at the OFNS_DESC "DANGEROUS DRUGS" to see which are the most common offenses related to drugs

In [26]:
data[data["OFNS_DESC"]=="DANGEROUS DRUGS"].head()
total=len(data[data["OFNS_DESC"]=="DANGEROUS DRUGS"]["PD_DESC"])
for desc in data[data["OFNS_DESC"]=="DANGEROUS DRUGS"]["PD_DESC"].unique():
    summ=(data[data["OFNS_DESC"]=="DANGEROUS DRUGS"]["PD_DESC"]==desc).sum()
    percent =round(summ/total*100,2)
    print("For ", desc, ":", percent, "%")
#We see that controlled substance, possession is the most common drug delict. 
For  CONTROLLED SUBSTANCE,INTENT TO : 12.43 %
For  CONTROLLED SUBSTANCE, POSSESSI : 45.31 %
For  MARIJUANA, SALE 4 & 5 : 7.53 %
For  MARIJUANA, POSSESSION 1, 2 & 3 : 3.06 %
For  CONTROLLED SUBSTANCE,SALE 3 : 5.06 %
For  CONTROLLED SUBSTANCE, SALE 5 : 6.59 %
For  CONTROLLED SUBSTANCE,POSSESS. : 9.68 %
For  CONTROLLED SUBSTANCE,SALE 1 : 2.49 %
For  CONTROLLED SUBSTANCE, SALE 4 : 0.59 %
For  SALE SCHOOL GROUNDS 4 : 0.71 %
For  MARIJUANA, SALE 1, 2 & 3 : 0.44 %
For  MARIJUANA, POSSESSION 4 & 5 : 2.45 %
For  CONTROLLED SUBSTANCE, INTENT T : 1.5 %
For  CONTROLLED SUBSTANCE,SALE 2 : 0.84 %
For  DRUG PARAPHERNALIA,   POSSESSE : 0.97 %
For  CONTROLLED SUBSTANCE,POSSESS.- : 0.18 %
For  POSSESSION HYPODERMIC INSTRUME : 0.15 %
For  POSS METH MANUFACT MATERIAL : 0.02 %

In the following plot we see the days with more offenses reported. We see that the two more dangerous days are the first of June and the first day of the year. This data makes sense since at New Year's Eve a lot of discussions tend to happen. Later on we will se a map of the distribution of crimes in that specific day, showing how crimes evolve through the hours.

In [27]:
bydays=data['CMPLNT_NUM'].groupby(data["CMPLNT_FR_DT"]).count()
bydays
bydays.sort_values(ascending=False)[:10].plot(kind="bar")
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fcf40753520>

Some data analysis

We want to see if there is any type of crime that is more frequent in the early-morning hours than during the day, or if they follow the same distributions.

In [28]:
#We distribute the dataset in two dataframes, one for the night (2 to 7 am) and another for the day (the rest of hours)
early_morning_df= data[(data["CMPLNT_FR_TM"].dt.hour >= 2)& (data["CMPLNT_FR_TM"].dt.hour <=7)]
during_day_df=data[(data["CMPLNT_FR_TM"].dt.hour < 2)| (data["CMPLNT_FR_TM"].dt.hour >7)]

In the following plot, we will show the most common offenses at night compared to the most common offenses during the day.

In [29]:
fig,axes=plt.subplots(1,2, figsize=(15, 5))
ofenses_night=early_morning_df['OFNS_DESC'].groupby(early_morning_df["OFNS_DESC"]).count()
ofenses_night.sort_values(ascending=False)[:10].plot(kind="bar",ax=axes[0])
axes[0].title.set_text('Most common offenses at night')

ofenses_day=during_day_df['OFNS_DESC'].groupby(during_day_df["OFNS_DESC"]).count()
ofenses_day.sort_values(ascending=False)[:10].plot(kind="bar",ax=axes[1])
axes[1].title.set_text('Most common offenses during the day')
plt.show()

We can see that the most common offense is Petit Larceny, both at night and during the day. However, during the night harassment are slightly less frequent, and robberies are more common. So we observe that the types of offenses that occur during the day and night are very similar, but some small differences can be seen.

In the next plot we will see again the difference between day and night, but this time using the variable PD_DESC, that shows a more descriptive definition of the crime.

In [30]:
fig,axes=plt.subplots(1,2, figsize=(15, 5))

ofenses_night=early_morning_df['PD_DESC'].groupby(early_morning_df["PD_DESC"]).count()
ofenses_night.sort_values(ascending=False)[:10].plot(kind="bar", ax=axes[0])
axes[0].title.set_text('Most common offenses at night (more descriptive)')

ofenses_day=during_day_df['PD_DESC'].groupby(during_day_df["PD_DESC"]).count()
ofenses_day.sort_values(ascending=False)[:10].plot(kind="bar", ax=axes[1])
axes[1].title.set_text('Most common offenses during the day (more descriptive)')
plt.show()

Again, we see that most offenses are similarly distributed for day and night. Eventhough, we can notice that larcenies to stores are more common during the day while criminal mischiefs are more common at night.

Now we will create a dataset containing only the crimes classified as violations. In this first plot, we see the most common places where harassment takes place. We notice that they are most frequent at houses than in the street, meaning that a big quantity of offenses are probably commited by relatives or roommates.

In [31]:
VIOLATIONS=data[data["LAW_CAT_CD"]=="VIOLATION"]

VIOLATIONS['PREM_TYP_DESC'].value_counts()[:10].plot.bar()
plt.ylabel('Count')
plt.title('Most common places of harassment')
plt.show()

The next plot shows the distribution of sexs for victims and suspects of violation crimes. By comparing them with the general plots, we see that in violations most of the victims are females (63% females vs 35% males), while in the general crimes it was almost equally distributed (40% females and 37% males). And the opposite occurs with the suspects, in the general distribution 45% of crimes were commited by males. In violation crimes, they are responsible of 56% o the offenses. So from this information it can be extracted that violations tend to be committed by men to women.

In [32]:
#Compare with the general ones
plot_countplot(VIOLATIONS, 'VIC_SEX')
plot_countplot(VIOLATIONS, 'SUSP_SEX')

CLUSTERING

We will perform a cluster regarding the profiles of victims, suspects and the level of offense, so we create a dataframe with the needed variables:

In [33]:
data_clustering=data[["LAW_CAT_CD", "SUSP_AGE_GROUP", "SUSP_RACE", "SUSP_SEX", "VIC_AGE_GROUP", "VIC_RACE", "VIC_SEX"]]
In [34]:
#We divide the data_clustering in two dataframes: one with the unknown values and the other one with the known ones, since we 
#will perform the clustering over the profiles that have been properly identified. Later on, we will try to predict some 
#characteristics of suspects that have not been identified.
data_cluster_topredict=data_clustering[(data_clustering["VIC_RACE"]=="UNKNOWN") | (data_clustering["VIC_AGE_GROUP"]=="UNKNOWN") | (data_clustering["VIC_SEX"]=="UNKNOWN")
                                | (data_clustering["SUSP_RACE"]=="UNKNOWN") | (data_clustering["SUSP_AGE_GROUP"]=="UNKNOWN") | (data_clustering["SUSP_SEX"]=="UNKNOWN")
                                 | (data_clustering["SUSP_SEX"]=="U")]
data_cluster = data_clustering.drop(index=data_cluster_topredict.index)
In [35]:
data_cluster.head()
Out[35]:
LAW_CAT_CD SUSP_AGE_GROUP SUSP_RACE SUSP_SEX VIC_AGE_GROUP VIC_RACE VIC_SEX
54 FELONY 25-44 BLACK HISPANIC M 45-64 WHITE HISPANIC F
60 VIOLATION 45-64 WHITE F 45-64 BLACK F
66 MISDEMEANOR 25-44 BLACK M 25-44 WHITE HISPANIC F
71 MISDEMEANOR 45-64 BLACK M 18-24 ASIAN / PACIFIC ISLANDER M
81 FELONY 25-44 WHITE HISPANIC M 25-44 WHITE HISPANIC F

First of all, we are going to encode the different values to numbers, so that we can perform the clustering.

In [36]:
!pip install category_encoders
Requirement already satisfied: category_encoders in /home/aran/anaconda3/lib/python3.8/site-packages (2.2.2)
Requirement already satisfied: numpy>=1.14.0 in /home/aran/anaconda3/lib/python3.8/site-packages (from category_encoders) (1.18.5)
Requirement already satisfied: statsmodels>=0.9.0 in /home/aran/anaconda3/lib/python3.8/site-packages (from category_encoders) (0.11.1)
Requirement already satisfied: pandas>=0.21.1 in /home/aran/anaconda3/lib/python3.8/site-packages (from category_encoders) (1.0.5)
Requirement already satisfied: scikit-learn>=0.20.0 in /home/aran/anaconda3/lib/python3.8/site-packages (from category_encoders) (0.23.1)
Requirement already satisfied: patsy>=0.5.1 in /home/aran/anaconda3/lib/python3.8/site-packages (from category_encoders) (0.5.1)
Requirement already satisfied: scipy>=1.0.0 in /home/aran/.local/lib/python3.8/site-packages (from category_encoders) (1.5.2)
Requirement already satisfied: pytz>=2017.2 in /home/aran/anaconda3/lib/python3.8/site-packages (from pandas>=0.21.1->category_encoders) (2020.1)
Requirement already satisfied: python-dateutil>=2.6.1 in /home/aran/anaconda3/lib/python3.8/site-packages (from pandas>=0.21.1->category_encoders) (2.8.1)
Requirement already satisfied: joblib>=0.11 in /home/aran/anaconda3/lib/python3.8/site-packages (from scikit-learn>=0.20.0->category_encoders) (0.16.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /home/aran/anaconda3/lib/python3.8/site-packages (from scikit-learn>=0.20.0->category_encoders) (2.1.0)
Requirement already satisfied: six in /home/aran/anaconda3/lib/python3.8/site-packages (from patsy>=0.5.1->category_encoders) (1.15.0)
In [37]:
import category_encoders as ce

# create object of Ordinalencoding
encoder= ce.OrdinalEncoder(cols=["LAW_CAT_CD", "SUSP_AGE_GROUP", "SUSP_RACE", "SUSP_SEX", 'VIC_SEX', 'VIC_AGE_GROUP', 'VIC_RACE'],return_df=True,
                           mapping=[{'col': "LAW_CAT_CD", 'mapping':{"FELONY": 1, "MISDEMEANOR": 0}},#VIOLATION IS -1
                                    {'col':'SUSP_SEX', 'mapping':{'M':-1,'F':1, 'E': 0, 'D': 0}}, 
                                    {'col': "SUSP_AGE_GROUP", 'mapping':{'<18': 14, '18-24': 22, '25-44': 35, '45-64': 55, '65+': 75}},
                                    {'col': "SUSP_RACE", 'mapping': {"AMERICAN INDIAN/ALASKAN NATIVE": -10, "ASIAN / PACIFIC ISLANDER": -6, "WHITE": -2,
                                                                    "WHITE HISPANIC": 2, "BLACK HISPANIC": 6, "BLACK": 10}},
                                    {'col':'VIC_SEX', 'mapping':{'M':-1,'F':1, 'E': 0, 'D': 0}}, 
                                    {'col': "VIC_AGE_GROUP", 'mapping':{'<18': 14, '18-24': 22, '25-44': 35, '45-64': 55, '65+': 75}},
                                    {'col': "VIC_RACE", 'mapping': {"AMERICAN INDIAN/ALASKAN NATIVE": -10, "ASIAN / PACIFIC ISLANDER": -6, "WHITE": -2,
                                                                    "WHITE HISPANIC": 2, "BLACK HISPANIC": 6, "BLACK": 10}},                                   
                                     ])
#Transformed data
df_train_transformed = encoder.fit_transform(data_cluster)
In [38]:
#We will plot the distributions of the variables now that the dataframe only contains identified people, and with the 
#variables already encoded. This step is just to check everything works as expected.
for var in df_train_transformed:
    plot_countplot(df_train_transformed, var)
In [39]:
#Choosing optimal number of clusters based on elbow method
sse={}
for num_clusters in list(range(1,8)):
    kmeans = KMeans(n_clusters=num_clusters, random_state=333)
    kmeans.fit(df_train_transformed)
    sse[num_clusters]=kmeans.inertia_

plt.title("Elbow criterion method chart")
sns.pointplot(x=list(sse.keys()), y=list(sse.values()))
plt.show()
print("With the elbow method we see that the optimal number of clusters for this dataset is 3.")
With the elbow method we see that the optimal number of clusters for this dataset is 3.
In [40]:
#After identifying the optimal number of clusters, we will classify the complaints in three clusters, using the variables 
#of suspects and victims, plus the type of crime.
kmeans=KMeans(n_clusters=3)
kmeans.fit(df_train_transformed)
clustering=kmeans.labels_

clustered_data = pd.DataFrame(df_train_transformed)
#we will add a column in the datasets corresponding to the computed cluster
clustered_data["Cluster"]=clustering
data_cluster["Cluster"]=clustering

Now we export the dataframe with the computed clusters to do some visualizations with Tableau in order to give some useful information.

In [41]:
#aran_outcomes = '/content/drive/MyDrive/Visual Analytics/VISUAL ANALYTICS NURIA & ARAN/data_clustered.csv'
#nuria_outcomes = "/content/drive/My Drive/VISUAL ANALYTICS NURIA & ARAN/data_clustered.csv"
data_cluster.to_csv('data_clustered.csv', sep=';')
In [42]:
data_cluster.head()
Out[42]:
LAW_CAT_CD SUSP_AGE_GROUP SUSP_RACE SUSP_SEX VIC_AGE_GROUP VIC_RACE VIC_SEX Cluster
54 FELONY 25-44 BLACK HISPANIC M 45-64 WHITE HISPANIC F 0
60 VIOLATION 45-64 WHITE F 45-64 BLACK F 2
66 MISDEMEANOR 25-44 BLACK M 25-44 WHITE HISPANIC F 1
71 MISDEMEANOR 45-64 BLACK M 18-24 ASIAN / PACIFIC ISLANDER M 2
81 FELONY 25-44 WHITE HISPANIC M 25-44 WHITE HISPANIC F 1

The three following cells show the description of the three clusters, where the variables are encoded in numbers. This is not very useful since with Tableau we will extract more interesting visualizations. But just as a small hint, we can detect that the ages of suspects in the cluster 0 are all around 31 years, and the victims also around 30. In contrast, the cluster 0 contains suspects all between 55 and 75 years, and victims with the same ages. Finally, cluster 2 contains suspects between 14 and 35 years, and victims from 55 to 75 years old. So we can see that the clusters have distributed the offenses in three kinds: crimes made from adults to adults, crimes from more elderly people to more elderly people, and crimes from young people to elderly ones.

In [43]:
clustered_data[clustered_data["Cluster"]==0].describe()
Out[43]:
LAW_CAT_CD SUSP_AGE_GROUP SUSP_RACE SUSP_SEX VIC_AGE_GROUP VIC_RACE VIC_SEX Cluster
count 17699.000000 17699.000000 17699.000000 17699.000000 17699.000000 17699.000000 17699.000000 17699.0
mean -0.024804 32.799085 5.073846 -0.465959 59.334708 3.546528 0.011187 0.0
std 0.792528 8.880440 5.438416 0.884832 8.240650 5.764919 0.999881 0.0
min -1.000000 14.000000 -10.000000 -1.000000 55.000000 -10.000000 -1.000000 0.0
25% -1.000000 35.000000 2.000000 -1.000000 55.000000 -2.000000 -1.000000 0.0
50% 0.000000 35.000000 6.000000 -1.000000 55.000000 2.000000 1.000000 0.0
75% 1.000000 35.000000 10.000000 1.000000 55.000000 10.000000 1.000000 0.0
max 1.000000 55.000000 10.000000 1.000000 75.000000 10.000000 1.000000 0.0
In [44]:
clustered_data[clustered_data["Cluster"]==1].describe()
Out[44]:
LAW_CAT_CD SUSP_AGE_GROUP SUSP_RACE SUSP_SEX VIC_AGE_GROUP VIC_RACE VIC_SEX Cluster
count 53143.000000 53143.000000 53143.000000 53143.000000 53143.000000 53143.000000 53143.000000 53143.0
mean 0.042226 31.261841 5.546281 -0.540711 30.525262 4.286886 0.333797 1.0
std 0.734598 7.135607 5.173945 0.841216 7.244921 5.491311 0.942594 0.0
min -1.000000 14.000000 -10.000000 -1.000000 14.000000 -10.000000 -1.000000 1.0
25% 0.000000 22.000000 2.000000 -1.000000 22.000000 2.000000 -1.000000 1.0
50% 0.000000 35.000000 10.000000 -1.000000 35.000000 2.000000 1.000000 1.0
75% 1.000000 35.000000 10.000000 -1.000000 35.000000 10.000000 1.000000 1.0
max 1.000000 55.000000 10.000000 1.000000 35.000000 10.000000 1.000000 1.0
In [45]:
clustered_data[clustered_data["Cluster"]==2].describe()
Out[45]:
LAW_CAT_CD SUSP_AGE_GROUP SUSP_RACE SUSP_SEX VIC_AGE_GROUP VIC_RACE VIC_SEX Cluster
count 17160.000000 17160.000000 17160.000000 17160.000000 17160.000000 17160.000000 17160.000000 17160.0
mean -0.180594 57.198135 4.347319 -0.539510 44.129138 3.682284 0.279138 2.0
std 0.762232 6.255652 5.747337 0.842003 12.281737 5.808590 0.960218 0.0
min -1.000000 55.000000 -10.000000 -1.000000 14.000000 -10.000000 -1.000000 2.0
25% -1.000000 55.000000 -2.000000 -1.000000 35.000000 -2.000000 -1.000000 2.0
50% 0.000000 55.000000 6.000000 -1.000000 35.000000 2.000000 1.000000 2.0
75% 0.000000 55.000000 10.000000 -1.000000 55.000000 10.000000 1.000000 2.0
max 1.000000 75.000000 10.000000 1.000000 75.000000 10.000000 1.000000 2.0

Trying to predict race of suspects

Now we will try to predict the race of suspects that have not been identified.

In [46]:
#Add BORO_NM column to data_clustering dataframe, since it is a variable that we will use to train the model
boros=data.loc[:, "BORO_NM"]
boros
data_clustering["BORO_NM"]=boros
<ipython-input-46-a9cd920f130a>:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_clustering["BORO_NM"]=boros
In [47]:
for value in data_clustering["SUSP_RACE"].unique():
    print(value, ": ",  len(data_clustering[data_clustering["SUSP_RACE"]==value]))
UNKNOWN :  139618
BLACK :  84353
BLACK HISPANIC :  11860
WHITE :  21258
WHITE HISPANIC :  37050
ASIAN / PACIFIC ISLANDER :  7965
AMERICAN INDIAN/ALASKAN NATIVE :  470
In [48]:
#We encode the variables again since now there is a new attribute: the borough
encoder_with_boro= ce.OrdinalEncoder(cols=["BORO_NM", "LAW_CAT_CD", "SUSP_AGE_GROUP", "SUSP_RACE", "SUSP_SEX", 'VIC_SEX', 'VIC_AGE_GROUP', 'VIC_RACE'],return_df=True,
                           mapping=[{'col': "BORO_NM", 'mapping':{"BRONX": 2, "BROOKLYN": 1, "MANHATTAN": 0, "QUEENS": -2}},#STATEN ISLAND IS -1
                                    {'col': "LAW_CAT_CD", 'mapping':{"FELONY": 1, "MISDEMEANOR": 0}},#VIOLATION IS -1
                                    {'col':'SUSP_SEX', 'mapping':{'M':-1,'F':1, 'E': 0, 'D': 0}}, 
                                    {'col': "SUSP_AGE_GROUP", 'mapping':{'<18': 14, '18-24': 22, '25-44': 35, '45-64': 55, '65+': 75}},
                                    {'col': "SUSP_RACE", 'mapping': {"AMERICAN INDIAN/ALASKAN NATIVE": -10, "ASIAN / PACIFIC ISLANDER": -6, "WHITE": -2,
                                                                    "WHITE HISPANIC": 2, "BLACK HISPANIC": 6, "BLACK": 10}},
                                    {'col':'VIC_SEX', 'mapping':{'M':-1,'F':1, 'E': 0, 'D': 0}}, 
                                    {'col': "VIC_AGE_GROUP", 'mapping':{'<18': 14, '18-24': 22, '25-44': 35, '45-64': 55, '65+': 75}},
                                    {'col': "VIC_RACE", 'mapping': {"AMERICAN INDIAN/ALASKAN NATIVE": -10, "ASIAN / PACIFIC ISLANDER": -6, "WHITE": -2,
                                                                    "WHITE HISPANIC": 2, "BLACK HISPANIC": 6, "BLACK": 10}},                                   
                                     ])
In [49]:
#We divide the dataframe in two: one with the unknown values of suspect races and the other one with the known ones
predict_susp_race=data_clustering[(data_clustering["SUSP_RACE"]=="UNKNOWN")]
predict_susp_race = encoder_with_boro.fit_transform(predict_susp_race)
train_susp_race = data_clustering.drop(index=predict_susp_race.index)
train_susp_race = encoder_with_boro.fit_transform(train_susp_race)
In [50]:
print("We have ", len(train_susp_race), "suspects with their race identified, and ", len(predict_susp_race), "whose race we don't know.")
print("We wanted to predict the race for the unknown ones, however we see that the amount of data we know is almost the same that the data we do not know.")
print("Therefore, it will be very difficult to get a good model that predicts correctly the race, also taking into account that all variables have been endoded.")
print("However, we will try to do these predictions since in any case it is more useful to know the race of the suspect with a 60% of probability that have no idea at all, so this information can be useful for the police, even though it is not very precise.")
We have  162956 suspects with their race identified, and  139618 whose race we don't know.
We wanted to predict the race for the unknown ones, however we see that the amount of data we know is almost the same that the data we do not know.
Therefore, it will be very difficult to get a good model that predicts correctly the race, also taking into account that all variables have been endoded.
However, we will try to do these predictions since in any case it is more useful to know the race of the suspect with a 60% of probability that have no idea at all, so this information can be useful for the police, even though it is not very precise.
In [51]:
#y: target variable, the race of the suspect.
#X: the rest of variables
y=train_susp_race["SUSP_RACE"]
X=train_susp_race.drop(["SUSP_RACE"], axis = 1)

y_topredict=predict_susp_race["SUSP_RACE"]
X_topredict=predict_susp_race.drop(["SUSP_RACE"], axis = 1)
In [52]:
#We split the data into test and train set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)

To do the predictions, we will use some different classifiers and see which one performs best. Then, we will use the one with best accuracy to predict the races of the suspects.

In [53]:
from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(random_state = 1)# Fit and evaluate models
gb.fit(X_train, y_train)
y_pred_gb = gb.predict(X_topredict)
In [54]:
gb.score(X_test,y_test)
Out[54]:
0.6215431189657994
In [55]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state = 1)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_topredict)
rf.score(X_test,y_test)
Out[55]:
0.6120929471444936
In [56]:
from sklearn.ensemble import AdaBoostClassifier
ab = AdaBoostClassifier(random_state = 1)
ab.fit(X_train, y_train)
y_pred_ab = ab.predict(X_topredict)
ab.score(X_test,y_test)
Out[56]:
0.6132384225167731
In [57]:
#Since the best accuracy has been given by GradientBoostingClassifier, we will use those predictions 
predict_susp_race["SUSP_RACE"]=y_pred_gb
In [58]:
#Here we have the encoded dataset with the predicted races of the victims, that should be bring back to 
#categories to be used by the police. However, we will let this for them.
predict_susp_race
Out[58]:
LAW_CAT_CD SUSP_AGE_GROUP SUSP_RACE SUSP_SEX VIC_AGE_GROUP VIC_RACE VIC_SEX BORO_NM
12 -1.0 -1.0 2 -1.0 55.0 2.0 -1 2.0
18 1.0 -1.0 10 -1.0 -1.0 -1.0 0 2.0
32 1.0 -1.0 10 -1.0 -1.0 -1.0 0 2.0
34 1.0 -1.0 10 -1.0 -1.0 -1.0 0 0.0
36 1.0 35.0 10 -1.0 35.0 10.0 1 0.0
... ... ... ... ... ... ... ... ...
306643 1.0 -1.0 10 -1.0 22.0 -6.0 -1 1.0
306647 -1.0 -1.0 10 -1.0 35.0 10.0 1 -2.0
306648 1.0 -1.0 10 -1.0 35.0 2.0 1 1.0
306651 0.0 -1.0 10 -1.0 35.0 -1.0 -1 1.0
306653 0.0 -1.0 2 -1.0 22.0 2.0 1 -2.0

139618 rows × 8 columns

Spatial Analysis

For a better understanding of our data, and taking advantage of the spatial variables that we have got, part of our project consist on the spatial analysis of the characteristics of the complaints in the city of New York.

First of all, we install and import some requirements from the .py files on the folder, that will help us with the usage of geodata.

In [59]:
from requirements import *
from ppca import PPCA
from utils import *
/home/aran/anaconda3/lib/python3.8/site-packages/geopandas/_compat.py:84: UserWarning: The Shapely GEOS version (3.8.0-CAPI-1.13.1 ) is incompatible with the GEOS version PyGEOS was compiled with (3.8.1-CAPI-1.13.3). Conversions between both will be slow.
  warnings.warn(

For this section, we will get back the cleaned data used above. For a quick recall:

In [60]:
print(data.shape)
data.head(3)
(302574, 23)
Out[60]:
CMPLNT_NUM ADDR_PCT_CD BORO_NM CMPLNT_FR_DT CMPLNT_FR_TM CRM_ATPT_CPTD_CD JURIS_DESC LAW_CAT_CD OFNS_DESC PARKS_NM PATROL_BORO PD_DESC PREM_TYP_DESC STATION_NAME SUSP_AGE_GROUP SUSP_RACE SUSP_SEX TRANSIT_DISTRICT VIC_AGE_GROUP VIC_RACE VIC_SEX Latitude Longitude
12 843609180 43 BRONX 2020-08-24 2021-12-31 20:57:00 COMPLETED N.Y. POLICE DEPT VIOLATION HARRASSMENT 2 NaN PATROL BORO BRONX HARASSMENT,SUBD 3,4,5 STREET NaN UNKNOWN UNKNOWN UNKNOWN NaN 45-64 WHITE HISPANIC M 40.832826 -73.867539
18 774311943 52 BRONX 2020-08-12 2021-12-31 00:05:00 COMPLETED N.Y. POLICE DEPT FELONY DANGEROUS WEAPONS NaN PATROL BORO BRONX WEAPONS POSSESSION 3 STREET NaN UNKNOWN UNKNOWN UNKNOWN NaN UNKNOWN UNKNOWN E 40.867046 -73.896316
32 215695380 44 BRONX 2020-06-06 2021-12-31 00:10:00 COMPLETED N.Y. POLICE DEPT FELONY DANGEROUS WEAPONS NaN PATROL BORO BRONX WEAPONS POSSESSION 3 STREET NaN UNKNOWN UNKNOWN UNKNOWN NaN UNKNOWN UNKNOWN E 40.825394 -73.926183

In this next step, we treat the Longitude and Latitude variables as Geo variables, in order to get the exact location of each of the complaints recorded on our dataset.

In [61]:
from geopandas import points_from_xy

#take the Longitude and Latitude columns to transform the pandas DataFrame into a GeoDataFrame

gdf = GeoDataFrame(data, geometry=points_from_xy(data['Longitude'], data['Latitude']))
In [104]:
#The GeoDataframe can be spatially plotted

Map(Layer(gdf, encode_data=False))
Out[104]:

Temporal Spatial Analysis

As previously anticipated, now we will take only those crimes commited in the first day of the year, and see how they evolve during the hours by using an animated map. There is a timeline where the hours can be selected, or play it and let it run. We classify the crimes in 3 colors depending on their jurisdiction (police, housing or transit). We have selected this day since we think it is interesting to see the distribution of crimes in New Year's day in a more visual and animated way. You will see that for that specific day, instead of being committed during the day (as we saw before that crimes in general are more frequent at day than at night) the majority of the cases are during the early morning hours. Probably, because it is a conflictive night in which people party a lot and that generates several complaints. Also we see that most offenses correspond to the Police department and very few belong to the transit one, which also makes sense.

In [63]:
data_time=data.copy()
data_time["hours"]=data_time["CMPLNT_FR_TM"].dt.hour
data_time.sort_values(by = ['hours'], inplace=True)
data_oneday=data_time[(data_time['CMPLNT_FR_DT'] == '2020-01-01')]
In [103]:
data_oneday["size"] = 1
px.set_mapbox_access_token("pk.eyJ1IjoibnVyaWF2IiwiYSI6ImNraWx6ZGRmNjBvYmYyeXBrazFsMjlzOHoifQ.2snfdHzFakwXOjjL51pMtg")
fig = px.scatter_mapbox(data_oneday, lat="Latitude", lon="Longitude",
                        color="JURIS_DESC", size="size", animation_frame="hours",
                  color_continuous_scale=px.colors.cyclical.IceFire, size_max=15, zoom=9, center={'lat': 40.73692, 'lon': -73.9519})
fig.show()
data_oneday = data_oneday.drop('size', axis = 1)
Out[103]:

In a similar way, we have decided to create another animated map, but this time, the timeline will be of days instead of hours. For a faster and better running compilation, we have decided to filter out the data and take only those complaints felony-related. We appreciate that the number of registered complaints increases heavily on 2020, being that year by far more informative than the previous one.

In [102]:
#Animated map of felony complaints per date

Map(
    Layer(
        gdf[gdf.LAW_CAT_CD == 'FELONY'],
        style=animation_style('CMPLNT_FR_DT', color='blue', size=5, duration=40, fade_in=0.5, fade_out=0.5),
        widgets=animation_widget(title='Date'), encode_data=False
    )
)
Out[102]:

Neighborhood Tabulation Areas Analysis

To get more information from the location of each of the complaints, we will import a geojson with the boundaries of the different Neighborhood Tabulation Areas (NTA) within the city of New York. Following this step, we will be able to study the data with spatial constraints, i.e. filter the complaints by NTA or Borough (District).

In [66]:
data_path = './data/NYAreas.geojson'

#geojson with the boundaries of the Neighborhood Tabulation Areas of the city of NY
ny_nta = gpd.read_file(data_path)
ny_nta.head()
Out[66]:
ntacode shape_area county_fips ntaname shape_leng boro_name boro_code geometry
0 BK88 54005019.048 047 Borough Park 39247.2278309 Brooklyn 3 MULTIPOLYGON (((-73.97605 40.63128, -73.97717 ...
1 QN51 52488277.5915 081 Murray Hill 33266.9048721 Queens 4 MULTIPOLYGON (((-73.80379 40.77561, -73.80099 ...
2 QN27 19726845.5925 081 East Elmhurst 19816.7121316 Queens 4 MULTIPOLYGON (((-73.86110 40.76366, -73.85993 ...
3 BK95 14522603.5915 047 Erasmus 18756.7018646 Brooklyn 3 MULTIPOLYGON (((-73.94826 40.63861, -73.94923 ...
4 QN07 22887772.8087 081 Hollis 20976.3355996 Queens 4 MULTIPOLYGON (((-73.75726 40.71814, -73.75589 ...

We check that the datatypes of all the columns are correct (by changing the county_fips and boro_code to int types that will be solved).

In [67]:
ny_nta['county_fips'] = ny_nta['county_fips'].astype('int64')
ny_nta['boro_code'] = ny_nta['boro_code'].astype('int64')
ny_nta.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   ntacode      195 non-null    object  
 1   shape_area   195 non-null    object  
 2   county_fips  195 non-null    int64   
 3   ntaname      195 non-null    object  
 4   shape_leng   195 non-null    object  
 5   boro_name    195 non-null    object  
 6   boro_code    195 non-null    int64   
 7   geometry     195 non-null    geometry
dtypes: geometry(1), int64(2), object(5)
memory usage: 12.3+ KB

Just as before, we can plot the geojson file into a map and see the boundaries of each of the NTAs of New York.

In [105]:
plot_size = (920, 400)

basicstyle = cartoframes.viz.basic_style(color = '#eaff00', stroke_color = '#eaff00', stroke_width = 2, opacity = 0.2)

Map(Layer(ny_nta, basicstyle, encode_data=False),  size = plot_size,show_info=True)
Out[105]:

Once both the complaints' dataset and the geojson with the different NTAs are prepared and ready, we start with the key-step of spatial analysis, joining both datasets regarding the location of the complaints, and thus adding to them, the correct NTA in which they are situated. The function from geopandas sjoin provides the solution to our need.

In [69]:
complete_gpd = gpd.sjoin(left_df=gdf, right_df=ny_nta, how='inner')

complete_gpd.head(5)
Out[69]:
CMPLNT_NUM ADDR_PCT_CD BORO_NM CMPLNT_FR_DT CMPLNT_FR_TM CRM_ATPT_CPTD_CD JURIS_DESC LAW_CAT_CD OFNS_DESC PARKS_NM PATROL_BORO PD_DESC PREM_TYP_DESC STATION_NAME SUSP_AGE_GROUP SUSP_RACE SUSP_SEX TRANSIT_DISTRICT VIC_AGE_GROUP VIC_RACE VIC_SEX Latitude Longitude geometry index_right ntacode shape_area county_fips ntaname shape_leng boro_name boro_code
12 843609180 43 BRONX 2020-08-24 2021-12-31 20:57:00 COMPLETED N.Y. POLICE DEPT VIOLATION HARRASSMENT 2 NaN PATROL BORO BRONX HARASSMENT,SUBD 3,4,5 STREET NaN UNKNOWN UNKNOWN UNKNOWN NaN 45-64 WHITE HISPANIC M 40.832826 -73.867539 POINT (-73.86754 40.83283) 33 BX08 15084986.2976 5 West Farms-Bronx River 17477.2947609 Bronx 2
342 137408448 43 BRONX 2020-09-25 2021-12-31 22:50:00 COMPLETED N.Y. POLICE DEPT MISDEMEANOR ASSAULT 3 & RELATED OFFENSES NaN PATROL BORO BRONX ASSAULT 3 RESIDENCE-HOUSE NaN 25-44 WHITE HISPANIC M NaN 18-24 BLACK F 40.833290 -73.872489 POINT (-73.87249 40.83329) 33 BX08 15084986.2976 5 West Farms-Bronx River 17477.2947609 Bronx 2
351 884889567 43 BRONX 2020-09-29 2021-12-31 23:30:00 COMPLETED N.Y. POLICE DEPT VIOLATION HARRASSMENT 2 NaN PATROL BORO BRONX HARASSMENT,SUBD 3,4,5 RESIDENCE - APT. HOUSE NaN 25-44 WHITE HISPANIC M NaN 45-64 WHITE HISPANIC F 40.832034 -73.868307 POINT (-73.86831 40.83203) 33 BX08 15084986.2976 5 West Farms-Bronx River 17477.2947609 Bronx 2
1039 136918974 43 BRONX 2020-09-26 2021-12-31 17:30:00 COMPLETED N.Y. POLICE DEPT FELONY GRAND LARCENY NaN PATROL BORO BRONX LARCENY,GRAND BY THEFT OF CREDIT CARD STREET NaN UNKNOWN UNKNOWN U NaN 45-64 BLACK M 40.838779 -73.864701 POINT (-73.86470 40.83878) 33 BX08 15084986.2976 5 West Farms-Bronx River 17477.2947609 Bronx 2
1108 959848190 43 BRONX 2020-09-22 2021-12-31 00:40:00 COMPLETED N.Y. HOUSING POLICE VIOLATION HARRASSMENT 2 NaN PATROL BORO BRONX HARASSMENT,SUBD 1,CIVILIAN RESIDENCE - PUBLIC HOUSING NaN 18-24 BLACK F NaN 25-44 WHITE HISPANIC F 40.834955 -73.878673 POINT (-73.87867 40.83495) 33 BX08 15084986.2976 5 West Farms-Bronx River 17477.2947609 Bronx 2
In [70]:
complete_gpd.info(memory_usage=False)
<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 302556 entries, 12 to 261804
Data columns (total 32 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   CMPLNT_NUM        302556 non-null  int64         
 1   ADDR_PCT_CD       302556 non-null  int64         
 2   BORO_NM           302556 non-null  category      
 3   CMPLNT_FR_DT      302556 non-null  datetime64[ns]
 4   CMPLNT_FR_TM      302556 non-null  datetime64[ns]
 5   CRM_ATPT_CPTD_CD  302556 non-null  category      
 6   JURIS_DESC        302556 non-null  category      
 7   LAW_CAT_CD        302556 non-null  category      
 8   OFNS_DESC         302556 non-null  category      
 9   PARKS_NM          2073 non-null    category      
 10  PATROL_BORO       302556 non-null  category      
 11  PD_DESC           302556 non-null  category      
 12  PREM_TYP_DESC     302556 non-null  category      
 13  STATION_NAME      5350 non-null    category      
 14  SUSP_AGE_GROUP    302556 non-null  category      
 15  SUSP_RACE         302556 non-null  category      
 16  SUSP_SEX          302556 non-null  category      
 17  TRANSIT_DISTRICT  5350 non-null    float64       
 18  VIC_AGE_GROUP     302556 non-null  category      
 19  VIC_RACE          302556 non-null  category      
 20  VIC_SEX           302556 non-null  category      
 21  Latitude          302556 non-null  float64       
 22  Longitude         302556 non-null  float64       
 23  geometry          302556 non-null  geometry      
 24  index_right       302556 non-null  int64         
 25  ntacode           302556 non-null  object        
 26  shape_area        302556 non-null  object        
 27  county_fips       302556 non-null  int64         
 28  ntaname           302556 non-null  object        
 29  shape_leng        302556 non-null  object        
 30  boro_name         302556 non-null  object        
 31  boro_code         302556 non-null  int64         
dtypes: category(16), datetime64[ns](2), float64(3), geometry(1), int64(5), object(5)

We can now filter the complaints by Boroughs and also by smaller areas, the NTAs. As an example, here we filter those complaints that are located only in West Farms-Bronx River, Bronx.

In [106]:
Map(Layer(complete_gpd[complete_gpd.ntaname == 'West Farms-Bronx River']))
Out[106]:

As we have done previously in the project, we will use the external platform Tableau, to visually represent our spatial data and thus, be able to better aproach our project. The dashboard can be seen by clicking in the following link: https://public.tableau.com/profile/aran3436#!/vizhome/NewYorkComplaints/Dashboard1?publish=yes

In [72]:
complete_gpd.to_csv('Compaints_GeoDF.csv', sep=';')

After doing the visualizations, we though it would be nice to add a map with the parks and another one with stations to give more support to our analysis and conlcusions taken from the dashboards. The data filtered with those complaints recorded on public parks or stations are painted according to the district. We appreciate that, as stated in the report and by visually observing the Tableau, Manhattan is clearly the district with more complaints on its public places.

In [107]:
#Map with complaints recorded on Stations

viewport = {'zoom': 9.8, 'lat':40.739261, 'lng': -73.862240}


color_bins_st = color_bins_style('boro_name', bins=len(complete_gpd['boro_name'].unique()), breaks=complete_gpd['boro_name'].unique())
color_bins_leg = color_bins_legend(title='Public Stations', description='Different Districts', footer='Color of the blob')

Map([Layer(complete_gpd[(complete_gpd['STATION_NAME'].notnull())], color_bins_st, color_bins_leg
           )], show_info=True, viewport=viewport)
Out[107]:
In [108]:
#Map with complaints recorded on Parks

viewport = {'zoom': 9.8, 'lat':40.739261, 'lng': -73.862240}

color_bins_st = color_bins_style('boro_name', bins=len(complete_gpd['boro_name'].unique()), breaks=complete_gpd['boro_name'].unique())
color_bins_leg = color_bins_legend(title='Public Parks', description='Different Districts', footer='Color of the blob')

Map([Layer(complete_gpd[(complete_gpd['PARKS_NM'].notnull())], color_bins_st, color_bins_leg
           )], show_info=True, viewport=viewport)
Out[108]:

Now we will change a little the whole perception of the project and the way we see our data. Instead of having a dataset with information about each of the complaints, we will now gahter the different instances and group them by their location, in other words, study the characteristics of each of the NTAs regarding the complaints that are located within.
This way we would be able to study and analyse the situation in each of the different NTAs of New York.

To do that, we have to encode the categorical variables (those that are not binary) with one-hot-encoding. By using the get_dummies() function will be able to approach our challenge correclty.

In [75]:
#we will only take the categorical columns we are interested in. 
interesting_cols = ['LAW_CAT_CD', 'OFNS_DESC']
suspect_cols = ['SUSP_AGE_GROUP', 'SUSP_RACE', 'SUSP_SEX']
vic_cols = ['VIC_AGE_GROUP', 'VIC_RACE', 'VIC_SEX']
needed_cols = ['ntaname', 'CMPLNT_NUM']

nta_gdf = complete_gpd[interesting_cols+suspect_cols+vic_cols+needed_cols]
print(nta_gdf.shape)
nta_gdf.info()
(302556, 10)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 302556 entries, 12 to 261804
Data columns (total 10 columns):
 #   Column          Non-Null Count   Dtype   
---  ------          --------------   -----   
 0   LAW_CAT_CD      302556 non-null  category
 1   OFNS_DESC       302556 non-null  category
 2   SUSP_AGE_GROUP  302556 non-null  category
 3   SUSP_RACE       302556 non-null  category
 4   SUSP_SEX        302556 non-null  category
 5   VIC_AGE_GROUP   302556 non-null  category
 6   VIC_RACE        302556 non-null  category
 7   VIC_SEX         302556 non-null  category
 8   ntaname         302556 non-null  object  
 9   CMPLNT_NUM      302556 non-null  int64   
dtypes: category(8), int64(1), object(1)
memory usage: 9.2+ MB

The total number of resulting columns will be the sum of all the unique values of the current variables.

In [76]:
for var in interesting_cols+suspect_cols+vic_cols: 
    print(f'Variable {var}: {len(complete_gpd[var].unique())} \n')
Variable LAW_CAT_CD: 3 

Variable OFNS_DESC: 57 

Variable SUSP_AGE_GROUP: 6 

Variable SUSP_RACE: 7 

Variable SUSP_SEX: 4 

Variable VIC_AGE_GROUP: 6 

Variable VIC_RACE: 7 

Variable VIC_SEX: 4 

As explained, we will group the data by NTA names, and it will result into the splitting of the columns on their unique values, and that's why we need numerical values instead, in order to sum them up altogheter. This can be done by using the get_dummies function.

In [77]:
for col in interesting_cols+suspect_cols+vic_cols:
    nta_gdf = pd.get_dummies(nta_gdf, columns = [col])

nta_gdf
Out[77]:
ntaname CMPLNT_NUM LAW_CAT_CD_FELONY LAW_CAT_CD_MISDEMEANOR LAW_CAT_CD_VIOLATION OFNS_DESC_ADMINISTRATIVE CODE OFNS_DESC_AGRICULTURE & MRKTS LAW-UNCLASSIFIED OFNS_DESC_ALCOHOLIC BEVERAGE CONTROL LAW OFNS_DESC_ANTICIPATORY OFFENSES OFNS_DESC_ARSON OFNS_DESC_ASSAULT 3 & RELATED OFFENSES OFNS_DESC_BURGLAR'S TOOLS OFNS_DESC_BURGLARY OFNS_DESC_CHILD ABANDONMENT/NON SUPPORT OFNS_DESC_CRIMINAL MISCHIEF & RELATED OF OFNS_DESC_CRIMINAL TRESPASS OFNS_DESC_DANGEROUS DRUGS OFNS_DESC_DANGEROUS WEAPONS OFNS_DESC_DISORDERLY CONDUCT OFNS_DESC_ENDAN WELFARE INCOMP OFNS_DESC_ESCAPE 3 OFNS_DESC_FELONY ASSAULT OFNS_DESC_FELONY SEX CRIMES OFNS_DESC_FORGERY OFNS_DESC_FRAUDS OFNS_DESC_FRAUDULENT ACCOSTING OFNS_DESC_GAMBLING OFNS_DESC_GRAND LARCENY OFNS_DESC_GRAND LARCENY OF MOTOR VEHICLE OFNS_DESC_HARRASSMENT 2 OFNS_DESC_HOMICIDE-NEGLIGENT,UNCLASSIFIE OFNS_DESC_HOMICIDE-NEGLIGENT-VEHICLE OFNS_DESC_INTOXICATED & IMPAIRED DRIVING OFNS_DESC_JOSTLING OFNS_DESC_KIDNAPPING OFNS_DESC_KIDNAPPING & RELATED OFFENSES OFNS_DESC_LOITERING/GAMBLING (CARDS, DIC OFNS_DESC_MISCELLANEOUS PENAL LAW OFNS_DESC_NEW YORK CITY HEALTH CODE OFNS_DESC_NYS LAWS-UNCLASSIFIED FELONY OFNS_DESC_NYS LAWS-UNCLASSIFIED VIOLATION OFNS_DESC_OFF. AGNST PUB ORD SENSBLTY & OFNS_DESC_OFFENSES AGAINST PUBLIC ADMINI OFNS_DESC_OFFENSES AGAINST PUBLIC SAFETY OFNS_DESC_OFFENSES AGAINST THE PERSON OFNS_DESC_OFFENSES INVOLVING FRAUD OFNS_DESC_OFFENSES RELATED TO CHILDREN OFNS_DESC_OTHER OFFENSES RELATED TO THEF OFNS_DESC_OTHER STATE LAWS OFNS_DESC_OTHER STATE LAWS (NON PENAL LA OFNS_DESC_PETIT LARCENY OFNS_DESC_PETIT LARCENY OF MOTOR VEHICLE OFNS_DESC_POSSESSION OF STOLEN PROPERTY OFNS_DESC_PROSTITUTION & RELATED OFFENSES OFNS_DESC_RAPE OFNS_DESC_ROBBERY OFNS_DESC_SEX CRIMES OFNS_DESC_THEFT OF SERVICES OFNS_DESC_THEFT-FRAUD OFNS_DESC_UNAUTHORIZED USE OF A VEHICLE OFNS_DESC_UNLAWFUL POSS. WEAP. ON SCHOOL OFNS_DESC_VEHICLE AND TRAFFIC LAWS SUSP_AGE_GROUP_18-24 SUSP_AGE_GROUP_25-44 SUSP_AGE_GROUP_45-64 SUSP_AGE_GROUP_65+ SUSP_AGE_GROUP_<18 SUSP_AGE_GROUP_UNKNOWN SUSP_RACE_AMERICAN INDIAN/ALASKAN NATIVE SUSP_RACE_ASIAN / PACIFIC ISLANDER SUSP_RACE_BLACK SUSP_RACE_BLACK HISPANIC SUSP_RACE_UNKNOWN SUSP_RACE_WHITE SUSP_RACE_WHITE HISPANIC SUSP_SEX_F SUSP_SEX_M SUSP_SEX_U SUSP_SEX_UNKNOWN VIC_AGE_GROUP_18-24 VIC_AGE_GROUP_25-44 VIC_AGE_GROUP_45-64 VIC_AGE_GROUP_65+ VIC_AGE_GROUP_<18 VIC_AGE_GROUP_UNKNOWN VIC_RACE_AMERICAN INDIAN/ALASKAN NATIVE VIC_RACE_ASIAN / PACIFIC ISLANDER VIC_RACE_BLACK VIC_RACE_BLACK HISPANIC VIC_RACE_UNKNOWN VIC_RACE_WHITE VIC_RACE_WHITE HISPANIC VIC_SEX_D VIC_SEX_E VIC_SEX_F VIC_SEX_M
12 West Farms-Bronx River 843609180 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1
342 West Farms-Bronx River 137408448 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0
351 West Farms-Bronx River 884889567 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0
1039 West Farms-Bronx River 136918974 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1
1108 West Farms-Bronx River 959848190 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
188232 Airport 705910592 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1
193922 Airport 250275227 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1
228287 Airport 238008981 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
248978 Airport 764957950 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1
261804 Airport 412215100 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0

302556 rows × 96 columns

Since the next step is grouping the instances by common ntaname and summing up all the values from the columns, we will have to change the id numbers of the complaints by 1, since we are only interested in the total count, not specifically the id number.

In [78]:
nta_gdf['CMPLNT_NUM'] = 1

We now group the instances by ntaname and sum up all the values of their different columns

In [79]:
#After having grouped the instanced by their ntaname, we now have a dataframe with all 
#the instances being the nta's of New York

nta_gdf = nta_gdf.groupby(['ntaname']).sum()
nta_gdf.head(3)
Out[79]:
CMPLNT_NUM LAW_CAT_CD_FELONY LAW_CAT_CD_MISDEMEANOR LAW_CAT_CD_VIOLATION OFNS_DESC_ADMINISTRATIVE CODE OFNS_DESC_AGRICULTURE & MRKTS LAW-UNCLASSIFIED OFNS_DESC_ALCOHOLIC BEVERAGE CONTROL LAW OFNS_DESC_ANTICIPATORY OFFENSES OFNS_DESC_ARSON OFNS_DESC_ASSAULT 3 & RELATED OFFENSES OFNS_DESC_BURGLAR'S TOOLS OFNS_DESC_BURGLARY OFNS_DESC_CHILD ABANDONMENT/NON SUPPORT OFNS_DESC_CRIMINAL MISCHIEF & RELATED OF OFNS_DESC_CRIMINAL TRESPASS OFNS_DESC_DANGEROUS DRUGS OFNS_DESC_DANGEROUS WEAPONS OFNS_DESC_DISORDERLY CONDUCT OFNS_DESC_ENDAN WELFARE INCOMP OFNS_DESC_ESCAPE 3 OFNS_DESC_FELONY ASSAULT OFNS_DESC_FELONY SEX CRIMES OFNS_DESC_FORGERY OFNS_DESC_FRAUDS OFNS_DESC_FRAUDULENT ACCOSTING OFNS_DESC_GAMBLING OFNS_DESC_GRAND LARCENY OFNS_DESC_GRAND LARCENY OF MOTOR VEHICLE OFNS_DESC_HARRASSMENT 2 OFNS_DESC_HOMICIDE-NEGLIGENT,UNCLASSIFIE OFNS_DESC_HOMICIDE-NEGLIGENT-VEHICLE OFNS_DESC_INTOXICATED & IMPAIRED DRIVING OFNS_DESC_JOSTLING OFNS_DESC_KIDNAPPING OFNS_DESC_KIDNAPPING & RELATED OFFENSES OFNS_DESC_LOITERING/GAMBLING (CARDS, DIC OFNS_DESC_MISCELLANEOUS PENAL LAW OFNS_DESC_NEW YORK CITY HEALTH CODE OFNS_DESC_NYS LAWS-UNCLASSIFIED FELONY OFNS_DESC_NYS LAWS-UNCLASSIFIED VIOLATION OFNS_DESC_OFF. AGNST PUB ORD SENSBLTY & OFNS_DESC_OFFENSES AGAINST PUBLIC ADMINI OFNS_DESC_OFFENSES AGAINST PUBLIC SAFETY OFNS_DESC_OFFENSES AGAINST THE PERSON OFNS_DESC_OFFENSES INVOLVING FRAUD OFNS_DESC_OFFENSES RELATED TO CHILDREN OFNS_DESC_OTHER OFFENSES RELATED TO THEF OFNS_DESC_OTHER STATE LAWS OFNS_DESC_OTHER STATE LAWS (NON PENAL LA OFNS_DESC_PETIT LARCENY OFNS_DESC_PETIT LARCENY OF MOTOR VEHICLE OFNS_DESC_POSSESSION OF STOLEN PROPERTY OFNS_DESC_PROSTITUTION & RELATED OFFENSES OFNS_DESC_RAPE OFNS_DESC_ROBBERY OFNS_DESC_SEX CRIMES OFNS_DESC_THEFT OF SERVICES OFNS_DESC_THEFT-FRAUD OFNS_DESC_UNAUTHORIZED USE OF A VEHICLE OFNS_DESC_UNLAWFUL POSS. WEAP. ON SCHOOL OFNS_DESC_VEHICLE AND TRAFFIC LAWS SUSP_AGE_GROUP_18-24 SUSP_AGE_GROUP_25-44 SUSP_AGE_GROUP_45-64 SUSP_AGE_GROUP_65+ SUSP_AGE_GROUP_<18 SUSP_AGE_GROUP_UNKNOWN SUSP_RACE_AMERICAN INDIAN/ALASKAN NATIVE SUSP_RACE_ASIAN / PACIFIC ISLANDER SUSP_RACE_BLACK SUSP_RACE_BLACK HISPANIC SUSP_RACE_UNKNOWN SUSP_RACE_WHITE SUSP_RACE_WHITE HISPANIC SUSP_SEX_F SUSP_SEX_M SUSP_SEX_U SUSP_SEX_UNKNOWN VIC_AGE_GROUP_18-24 VIC_AGE_GROUP_25-44 VIC_AGE_GROUP_45-64 VIC_AGE_GROUP_65+ VIC_AGE_GROUP_<18 VIC_AGE_GROUP_UNKNOWN VIC_RACE_AMERICAN INDIAN/ALASKAN NATIVE VIC_RACE_ASIAN / PACIFIC ISLANDER VIC_RACE_BLACK VIC_RACE_BLACK HISPANIC VIC_RACE_UNKNOWN VIC_RACE_WHITE VIC_RACE_WHITE HISPANIC VIC_SEX_D VIC_SEX_E VIC_SEX_F VIC_SEX_M
ntaname
Airport 13 1.0 11.0 1.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6.0 1.0 3.0 1.0 0.0 0.0 8.0 0.0 1.0 1.0 0.0 9.0 0.0 2.0 1.0 4.0 4.0 4.0 1.0 5.0 4.0 0.0 0.0 3.0 0.0 2.0 5.0 0.0 3.0 2.0 1.0 1.0 2.0 2.0 8.0
Allerton-Pelham Gardens 742 192.0 396.0 154.0 4.0 0.0 0.0 0.0 1.0 76.0 0.0 23.0 0.0 97.0 5.0 5.0 5.0 0.0 0.0 0.0 32.0 0.0 0.0 0.0 0.0 0.0 57.0 27.0 154.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 26.0 0.0 0.0 0.0 38.0 12.0 0.0 1.0 0.0 0.0 0.0 0.0 2.0 137.0 0.0 1.0 0.0 0.0 13.0 0.0 0.0 1.0 3.0 0.0 19.0 42.0 190.0 75.0 3.0 12.0 420.0 6.0 11.0 186.0 38.0 359.0 39.0 103.0 103.0 314.0 102.0 223.0 71.0 326.0 223.0 44.0 12.0 66.0 9.0 56.0 226.0 52.0 96.0 110.0 193.0 37.0 27.0 363.0 315.0
Annadale-Huguenot-Prince's Bay-Eltingville 412 103.0 220.0 89.0 2.0 0.0 0.0 0.0 0.0 43.0 0.0 10.0 0.0 61.0 9.0 4.0 3.0 0.0 0.0 0.0 16.0 0.0 0.0 4.0 0.0 0.0 25.0 11.0 86.0 0.0 0.0 14.0 0.0 0.0 0.0 0.0 18.0 0.0 0.0 0.0 28.0 20.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 40.0 0.0 3.0 0.0 0.0 3.0 0.0 0.0 6.0 0.0 0.0 3.0 21.0 112.0 44.0 9.0 6.0 220.0 2.0 5.0 19.0 0.0 206.0 152.0 28.0 48.0 170.0 37.0 157.0 28.0 142.0 110.0 34.0 21.0 77.0 1.0 15.0 6.0 3.0 84.0 283.0 20.0 19.0 52.0 185.0 156.0

Again, we are interested in the spatial location of our data, so using the geojson data, we merge the information with the index of our current dataset, the NTA's names

In [80]:
#As before, we add the spatial variables to the dataframe, so we can map-plot our data.
nta_gdf = pd.merge(nta_gdf, ny_nta, on='ntaname')
nta_gdf.head(3)
Out[80]:
ntaname CMPLNT_NUM LAW_CAT_CD_FELONY LAW_CAT_CD_MISDEMEANOR LAW_CAT_CD_VIOLATION OFNS_DESC_ADMINISTRATIVE CODE OFNS_DESC_AGRICULTURE & MRKTS LAW-UNCLASSIFIED OFNS_DESC_ALCOHOLIC BEVERAGE CONTROL LAW OFNS_DESC_ANTICIPATORY OFFENSES OFNS_DESC_ARSON OFNS_DESC_ASSAULT 3 & RELATED OFFENSES OFNS_DESC_BURGLAR'S TOOLS OFNS_DESC_BURGLARY OFNS_DESC_CHILD ABANDONMENT/NON SUPPORT OFNS_DESC_CRIMINAL MISCHIEF & RELATED OF OFNS_DESC_CRIMINAL TRESPASS OFNS_DESC_DANGEROUS DRUGS OFNS_DESC_DANGEROUS WEAPONS OFNS_DESC_DISORDERLY CONDUCT OFNS_DESC_ENDAN WELFARE INCOMP OFNS_DESC_ESCAPE 3 OFNS_DESC_FELONY ASSAULT OFNS_DESC_FELONY SEX CRIMES OFNS_DESC_FORGERY OFNS_DESC_FRAUDS OFNS_DESC_FRAUDULENT ACCOSTING OFNS_DESC_GAMBLING OFNS_DESC_GRAND LARCENY OFNS_DESC_GRAND LARCENY OF MOTOR VEHICLE OFNS_DESC_HARRASSMENT 2 OFNS_DESC_HOMICIDE-NEGLIGENT,UNCLASSIFIE OFNS_DESC_HOMICIDE-NEGLIGENT-VEHICLE OFNS_DESC_INTOXICATED & IMPAIRED DRIVING OFNS_DESC_JOSTLING OFNS_DESC_KIDNAPPING OFNS_DESC_KIDNAPPING & RELATED OFFENSES OFNS_DESC_LOITERING/GAMBLING (CARDS, DIC OFNS_DESC_MISCELLANEOUS PENAL LAW OFNS_DESC_NEW YORK CITY HEALTH CODE OFNS_DESC_NYS LAWS-UNCLASSIFIED FELONY OFNS_DESC_NYS LAWS-UNCLASSIFIED VIOLATION OFNS_DESC_OFF. AGNST PUB ORD SENSBLTY & OFNS_DESC_OFFENSES AGAINST PUBLIC ADMINI OFNS_DESC_OFFENSES AGAINST PUBLIC SAFETY OFNS_DESC_OFFENSES AGAINST THE PERSON OFNS_DESC_OFFENSES INVOLVING FRAUD OFNS_DESC_OFFENSES RELATED TO CHILDREN OFNS_DESC_OTHER OFFENSES RELATED TO THEF OFNS_DESC_OTHER STATE LAWS OFNS_DESC_OTHER STATE LAWS (NON PENAL LA OFNS_DESC_PETIT LARCENY OFNS_DESC_PETIT LARCENY OF MOTOR VEHICLE OFNS_DESC_POSSESSION OF STOLEN PROPERTY OFNS_DESC_PROSTITUTION & RELATED OFFENSES OFNS_DESC_RAPE OFNS_DESC_ROBBERY OFNS_DESC_SEX CRIMES OFNS_DESC_THEFT OF SERVICES OFNS_DESC_THEFT-FRAUD OFNS_DESC_UNAUTHORIZED USE OF A VEHICLE OFNS_DESC_UNLAWFUL POSS. WEAP. ON SCHOOL OFNS_DESC_VEHICLE AND TRAFFIC LAWS SUSP_AGE_GROUP_18-24 SUSP_AGE_GROUP_25-44 SUSP_AGE_GROUP_45-64 SUSP_AGE_GROUP_65+ SUSP_AGE_GROUP_<18 SUSP_AGE_GROUP_UNKNOWN SUSP_RACE_AMERICAN INDIAN/ALASKAN NATIVE SUSP_RACE_ASIAN / PACIFIC ISLANDER SUSP_RACE_BLACK SUSP_RACE_BLACK HISPANIC SUSP_RACE_UNKNOWN SUSP_RACE_WHITE SUSP_RACE_WHITE HISPANIC SUSP_SEX_F SUSP_SEX_M SUSP_SEX_U SUSP_SEX_UNKNOWN VIC_AGE_GROUP_18-24 VIC_AGE_GROUP_25-44 VIC_AGE_GROUP_45-64 VIC_AGE_GROUP_65+ VIC_AGE_GROUP_<18 VIC_AGE_GROUP_UNKNOWN VIC_RACE_AMERICAN INDIAN/ALASKAN NATIVE VIC_RACE_ASIAN / PACIFIC ISLANDER VIC_RACE_BLACK VIC_RACE_BLACK HISPANIC VIC_RACE_UNKNOWN VIC_RACE_WHITE VIC_RACE_WHITE HISPANIC VIC_SEX_D VIC_SEX_E VIC_SEX_F VIC_SEX_M ntacode shape_area county_fips shape_leng boro_name boro_code geometry
0 Airport 13 1.0 11.0 1.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6.0 1.0 3.0 1.0 0.0 0.0 8.0 0.0 1.0 1.0 0.0 9.0 0.0 2.0 1.0 4.0 4.0 4.0 1.0 5.0 4.0 0.0 0.0 3.0 0.0 2.0 5.0 0.0 3.0 2.0 1.0 1.0 2.0 2.0 8.0 QN98 228198998.046 81 136232.079658 Queens 4 MULTIPOLYGON (((-73.74712 40.63716, -73.74707 ...
1 Allerton-Pelham Gardens 742 192.0 396.0 154.0 4.0 0.0 0.0 0.0 1.0 76.0 0.0 23.0 0.0 97.0 5.0 5.0 5.0 0.0 0.0 0.0 32.0 0.0 0.0 0.0 0.0 0.0 57.0 27.0 154.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 26.0 0.0 0.0 0.0 38.0 12.0 0.0 1.0 0.0 0.0 0.0 0.0 2.0 137.0 0.0 1.0 0.0 0.0 13.0 0.0 0.0 1.0 3.0 0.0 19.0 42.0 190.0 75.0 3.0 12.0 420.0 6.0 11.0 186.0 38.0 359.0 39.0 103.0 103.0 314.0 102.0 223.0 71.0 326.0 223.0 44.0 12.0 66.0 9.0 56.0 226.0 52.0 96.0 110.0 193.0 37.0 27.0 363.0 315.0 BX31 31693044.2846 5 25467.1174253 Bronx 2 MULTIPOLYGON (((-73.84793 40.87134, -73.84725 ...
2 Annadale-Huguenot-Prince's Bay-Eltingville 412 103.0 220.0 89.0 2.0 0.0 0.0 0.0 0.0 43.0 0.0 10.0 0.0 61.0 9.0 4.0 3.0 0.0 0.0 0.0 16.0 0.0 0.0 4.0 0.0 0.0 25.0 11.0 86.0 0.0 0.0 14.0 0.0 0.0 0.0 0.0 18.0 0.0 0.0 0.0 28.0 20.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 40.0 0.0 3.0 0.0 0.0 3.0 0.0 0.0 6.0 0.0 0.0 3.0 21.0 112.0 44.0 9.0 6.0 220.0 2.0 5.0 19.0 0.0 206.0 152.0 28.0 48.0 170.0 37.0 157.0 28.0 142.0 110.0 34.0 21.0 77.0 1.0 15.0 6.0 3.0 84.0 283.0 20.0 19.0 52.0 185.0 156.0 SI01 141057788.612 85 71174.5830099 Staten Island 5 MULTIPOLYGON (((-74.16983 40.56109, -74.16983 ...

Before creating some maps, we thought it was good to show some statistics. Staten Island is the district with less complaints and the difference is quite noticeable. We also appreciate that Queens and Bronx have an overall count of complaints similar, but when it comes to the mean, the second one is considerably higher. In a similar way, we notice that although Brooklyn concentrates big part of the complaints within its area, due to the large number of neighborhoods, the mean number is reduced to a similar value as Bronx, making Manhattan the district with more complaints per neighborhood by a large difference.

In [81]:
stat_isl = nta_gdf[nta_gdf.boro_name == 'Staten Island']['CMPLNT_NUM']
queens = nta_gdf[nta_gdf.boro_name == 'Queens']['CMPLNT_NUM']
bronx = nta_gdf[nta_gdf.boro_name == 'Bronx']['CMPLNT_NUM']
brooklyn = nta_gdf[nta_gdf.boro_name == 'Brooklyn']['CMPLNT_NUM']
manhattan = nta_gdf[nta_gdf.boro_name == 'Manhattan']['CMPLNT_NUM']

print(f' Staten Island. Total Complaints: {stat_isl.sum()}. Mean: {round(stat_isl.mean(),2)}')
print(f' Queens. Total Complaints: {queens.sum()}. Mean: {round(queens.mean(),2)}')
print(f' Bronx. Total Complaints: {bronx.sum()}. Mean: {round(bronx.mean(),2)}')
print(f' Brooklyn. Total Complaints: {brooklyn.sum()}. Mean: {round(brooklyn.mean(),2)}')
print(f' Manhattan. Total Complaints: {manhattan.sum()}. Mean: {round(manhattan.mean(),2)}')
 Staten Island. Total Complaints: 12495. Mean: 657.63
 Queens. Total Complaints: 64977. Mean: 1120.29
 Bronx. Total Complaints: 66006. Mean: 1737.0
 Brooklyn. Total Complaints: 87754. Mean: 1720.67
 Manhattan. Total Complaints: 71324. Mean: 2459.45

Total complaints Map

Bearing in mind the statistics above, now it is time to plot some maps and see them more visually. This total complaints map is the accomplishment of this final part of the project, differentiating the areas of New York by the spatial data of the complaints.

In [109]:
# Neighborhood Tabulation Areas colored by total complaints


Map([Layer(nta_gdf,geom_col = 'geometry', style=color_bins_style('CMPLNT_NUM'), title='Total Complaints', 
          widgets = histogram_widget('CMPLNT_NUM', title = 'Histogram of Total Complaints'))], show_info=True)
Out[109]:

Maps filtered by Offenses

To finish, we put everything together in order to create some really interesting maps with different offenses that paint the background of the map, and the complaints that follow the pattern and are somehow filtered, representing a sample of the data as points.

In this first one map, we paint the neighborhood according to the number of Dangerous Drugs Offenses Complaints, which we observe are more common in Manhattan, Bronx, specially on their intersection, and finally in the center of Brooklyn. The points represent a sample of this crime commited by people from 18 to 24, and are coloured by race.

In [110]:
# Neighborhood Tabulation Areas painted by count of Dangerous Drugs offenses. Points being suspects coloured by race.

viewport = {'zoom': 8.7, 'lat':40.690433, 'lng': -74.148701}

breaks = ['WHITE HISPANIC', 'WHITE', 'BLACK HISPANIC', 'BLACK']
color_bins_st = color_bins_style('SUSP_RACE', bins=len(breaks), breaks=breaks, palette='mint')
color_bins_leg = color_bins_legend(title='Suspicious Race', description='Different Races', footer='Color of the blob')

Map([Layer(nta_gdf,geom_col = 'geometry', style=color_bins_style('OFNS_DESC_DANGEROUS DRUGS', palette='pinkyl'), 
           title='Dangerous Drugs Offense', 
           widgets = histogram_widget('OFNS_DESC_DANGEROUS DRUGS', title = 'Histogram of Dangerous Drugs')),
     Layer(complete_gpd[(complete_gpd.OFNS_DESC == 'DANGEROUS DRUGS') & (complete_gpd.SUSP_AGE_GROUP == '18-24') 
                        & ((complete_gpd.SUSP_RACE == 'WHITE HISPANIC') | (complete_gpd.SUSP_RACE == 'WHITE') | (complete_gpd.SUSP_RACE == 'BLACK') | (complete_gpd.SUSP_RACE == 'BLACK HISPANIC') )], 
           color_bins_st, color_bins_leg)], show_info=True, viewport=viewport)
Out[110]:

In this second map, similarly as before, we paint the neighborhood according to the number of Harrassment 2 Offense Complaints, which we observe are maybe a little bit more spreaded, and thus we see red areas in Manhattan, Bronx, and Brooklyn, with some in Queens too. The points represent a sample of this crime commited by Men to underaged Women.

In [111]:
# Neighborhood Tabulation Areas painted by count of Harassement 2 offenses. Points being Male suspects that acted on underaged Female victims 

viewport = {'zoom': 8.7, 'lat':40.690433, 'lng': -74.148701}

breaks = ['<18']
color_bins_st = color_bins_style('VIC_AGE_GROUP', bins=len(breaks), breaks=breaks, palette='mint')
color_bins_leg = color_bins_legend(title='Victim Age Group', description='Different Age Ranges', footer='Color of the blob')

Map([Layer(nta_gdf,geom_col = 'geometry', style=color_bins_style('OFNS_DESC_HARRASSMENT 2', palette='pinkyl'), 
           title='Harrassment 2 Offense', 
           widgets = histogram_widget('OFNS_DESC_HARRASSMENT 2', title = 'Histogram of Harrasment 2')),
     Layer(complete_gpd[(complete_gpd.OFNS_DESC == 'HARRASSMENT 2') & (complete_gpd.VIC_SEX == 'F') 
                        & (complete_gpd.SUSP_SEX == 'M') & (complete_gpd.VIC_AGE_GROUP == '<18')], 
           color_bins_st, color_bins_leg)], show_info=True, viewport=viewport)
Out[111]:

Finally, in this last map, we paint the neighborhood according to the number of Dangerous Weapons Offense Complaints, which we observe are highly located in Bronx, and its intersection with Manhattan, but we find some red areas in the center of Brooklyn and Queens too. The points represent a sample of this crime commited by people aged 25 to 44 age and coloured according to their sex.

In [112]:
# Neighborhood Tabulation Areas painted by count of DANGEROUS WEAPONS offenses. Points being suspects colored on genre 

viewport = {'zoom': 8.7, 'lat':40.690433, 'lng': -74.148701}

breaks = ['F', 'M']
color_bins_st = color_bins_style('SUSP_SEX', bins=len(breaks), breaks=breaks, palette='mint')
color_bins_leg = color_bins_legend(title='Suspect Sex', description='Different Sex', footer='Color of the blob')

Map([Layer(nta_gdf,geom_col = 'geometry', style=color_bins_style('OFNS_DESC_DANGEROUS WEAPONS', palette='pinkyl'), 
           title='Larceny of Motor Vehicle Offense', 
           widgets = histogram_widget('OFNS_DESC_DANGEROUS WEAPONS', title = 'Histogram of Motor Vehicle Larceny')),
     Layer(complete_gpd[(complete_gpd.OFNS_DESC == 'DANGEROUS WEAPONS') & ((complete_gpd.SUSP_SEX == 'F') 
                        | (complete_gpd.SUSP_SEX == 'M')) & (complete_gpd.VIC_AGE_GROUP == '25-44')], 
           color_bins_st, color_bins_leg)], show_info=True, viewport=viewport)
Out[112]:

For a little recap of this last maps, we see that the neighborhoods in the intersection of Manhattan and Bronx are usually painted red, meaning they are on top of the count of complaints, and being followed by their respective districts, we encounter the central part of Brooklyn, which also tends to be painted of the darkest color of the scale. Then, we will probably have the center area of Queens. What we see is that Staten Island usually records very little complaints.

On the third section of our Report we go deeper into our analysis and conlcusions.